In another blog post, we explored some features of a structural topic model to analyze our data. Specifically, we examined how topic prevalences were influenced by the year of defense. However, the stm framework also allows for the inclusion of additional variables that could serve as determinants of authors’ topic choices. In this blog article, we will ask to what extent gender influenced the topics of French theses. We will train a probabilistic topic model on french abstract abstract_fr and try to predict the resulted topics by authors’ gender. If you are not familiar with probabilistic topic models, we recommend reviewing the first blog post and the related references for background information.
Explore the data
First, let us examine the data and analyze the distribution of gender in the French dataset. The dataset includes two distinct variables, gender and gender_expanded, which provide information about the gender of individuals (e.g., authors, supervisors).
gender variable contains raw information obtained from the idref repository.
gender_expanded ariable is an enhanced version that imputes missing gender values using French census data.
A detailed explanation of these variables can be found in the documentation of the French database available on this website. We will use gender_expanded in this analysis. The gender_expanded variable is derived from the person table, whereas the abstract_fr variable is obtained from the metadata table. To link these variables, we must utilize the edge table, which associates each thesis identifier (thesis_id) with corresponding entities (entity_id).
Show the code
library(DT)thesis_metadata <-readRDS(here(FR_cleaned_data_path, "thesis_metadata.rds"))thesis_person <-readRDS(here(FR_cleaned_data_path, "thesis_person.rds")) # Show sample of each tables for illustration # Sample data from thesis_metadatametadata_sample <- thesis_metadata %>%select(thesis_id, abstract_fr, year_defence) %>%mutate(abstract_fr =str_trunc(abstract_fr, 20)) %>%slice_sample(n =5)# Sample data from thesis_personperson_sample <- thesis_person %>%select(entity_id, entity_name, entity_firstname, gender_expanded) %>%slice_sample(n =5)# Convert samples to datatablesmetadata_table <-datatable(metadata_sample, options =list(pageLength =5,searching =FALSE,lengthChange =FALSE), caption =" Metadata")person_table <-datatable(person_sample, options =list(pageLength =5,searching =FALSE,lengthChange =FALSE), caption =" Persons")# Display tables side by side using HTMLhtmltools::browsable( htmltools::tagList( htmltools::tags$div(style ="display: flex; gap: 20px;", htmltools::tags$div(style ="flex: 1;", metadata_table), htmltools::tags$div(style ="flex: 1;", person_table) ) ))# Join them with edge tablethesis_edge <-readRDS(here(FR_cleaned_data_path, "thesis_edge.rds"))authors_doc_id <- thesis_edge %>%filter(entity_role =="author") %>%select(entity_id, thesis_id) %>% uniqueauthors_gender <- thesis_person %>%select(entity_id, entity_name, entity_firstname, gender_expanded)# manage duplicateduplicates <- thesis_metadata %>%filter(!is.na(duplicates)) %>%group_by(duplicates) %>%# when the line has a duplicate, group and keep the older valueslice_max(year_defence, n =1, with_ties =FALSE)thesis_metadata_no_duplicate <- thesis_metadata %>%filter(is.na(duplicates)) %>%bind_rows(duplicates)# We can now select the relevant variables for stmdata <- thesis_metadata_no_duplicate %>%# keep relevant columnsselect(thesis_id, abstract_fr, year_defence) %>%# keep only documents with a title filter(!is.na(abstract_fr)) %>%# double join to add a gender to each documentleft_join(authors_doc_id, by ="thesis_id") %>%left_join(authors_gender, by ="entity_id")
Table 1: Sample of metadata and person tables
We now evaluate the distribution of our data. To do so, we filter out entries where the abstract_fr or the authors’ gender_expanded information is unavailable. Figure 1 illustrates the distribution of theses with a French abstract over time, categorized by gender.
data <- data %>%# filter NA abstract and genderfilter(!is.na(gender_expanded), !is.na(year_defence))gg <- data %>%group_by(gender_expanded, year_defence) %>%summarise(n =n()) %>% ungroup %>%mutate(tooltip =paste("Année:", year_defence, "<br>Nombre d'auteurs:", n, "<br>Genre:", gender_expanded)) %>%ggplot(aes(x =as.integer(year_defence),y = n,fill = gender_expanded,text = tooltip )) +geom_col() +theme_light() +labs(x ="", y ="Nombre d'auteurs", fill ="Genre") +scale_fill_brewer(palette ="Set3") +theme_light()plotly::ggplotly(gg, tooltip ="text") %>% plotly::config(displayModeBar =FALSE)
Figure 1: Distribution of authors by gender
Show the code
distribution <- data %>%group_by(gender_expanded, year_defence) %>%summarise(n =n()) %>% ungroup distribution %>% DT::datatable(extensions ='Buttons',options =list(dom ='Blfrtip',buttons =c('excel', 'csv'),pageLength =12 ))
Table 2
Distribution du genre par année
The practice of including an abstract became common in the early 1980s and was formally institutionalized in 1984 with the Savary reform. While some abstracts exist for theses defended earlier, a closer examination reveals that these were added by librarians who created the entries. Therefore, it is reasonable to filter out theses defended before 1980.
To run our probabilistic topic model, we do the usual pre-processing process, we already described in this post. The primary difference in this instance is the inclusion of a few custom stopwords to address highly redundant terms frequently found in abstracts, such as chapter (chapitre) or thesis (thèse).
Show the code
# first filter year before 1984data <- data %>%filter(year_defence >1979)# the rest of the script is a personal stm::textProcessor()# TOKENIZATIONlibrary(spacyr)#' install spacy if necessary#' `spacy_install(force = TRUE)`#' `spacy_download_langmodel("fr_core_news_lg", force = TRUE)`spacy_initialize("fr_core_news_lg") parsed <- data %>%# some pre-cleaning of abstractmutate(abstract_fr =str_to_lower(abstract_fr),abstract_fr =str_replace_all(abstract_fr, "", " ") %>%str_replace_all(., "", " "),abstract_fr =str_remove_all(abstract_fr, "$\\(?Résumé"),abstract_fr =str_replace_all(abstract_fr, "-", " "),abstract_fr =str_squish(abstract_fr)) %>%pull(abstract_fr) %>%# identify words spacyr::spacy_parse(multithread =TRUE)id <- data %>%distinct(thesis_id) %>% ungroup %>%mutate(doc_id =paste0("text", 1:n()))parsed <- parsed %>%left_join(id, join_by(doc_id)) %>%select(-doc_id)saveRDS(parsed, here(website_data_path, "parsed.rds"), compress =TRUE)# FILTER TOKENS parsed <-readRDS(here(website_data_path, "parsed.rds"))# prepare stop_wordsstop_words <-bind_rows(get_stopwords(language ="fr", source ="stopwords-iso"),get_stopwords(language ="fr", source ="snowball"),# some titles have english expressionget_stopwords(language ="en", source ="snowball")) %>%distinct(word) %>%pull(word)# add custom stopwords custom_stop_words <-c("chapitre", "thèse","montrons","montre","analyse")stop_words <-c(stop_words, custom_stop_words)parsed_filtered <- parsed %>%# Count original idsmutate(original_count =n_distinct(thesis_id)) %>%# Filter empty tokens and track removed idsfilter(!pos %in%c("PUNCT", "SYM", "SPACE")) %>%mutate(after_filter1 =n_distinct(thesis_id)) %>% { message("Doc removed after filter: ", unique(.$original_count) -unique(.$after_filter1)); . } %>%# remove special character filter(!token %in%c("-", "δ", "α", "σ", "γ", "東一")) %>%mutate(after_filter2 =n_distinct(thesis_id)) %>% { message("Doc removed after filter: ", unique(.$after_filter2) -unique(.$after_filter1)); . } %>%# remove any digit token (including those with letters after digits such as 12eme)filter(!str_detect(token, "^\\d+.*$")) %>%mutate(after_filter3 =n_distinct(thesis_id)) %>% { message("Doc removed after filter: ", unique(.$after_filter3) -unique(.$after_filter2)); . } %>%# Remove pronouns and french truncmutate(token =str_remove_all(token, "^[ld]'"),token =str_remove_all(token, "[[:punct:]]")) %>%# Filter single letters and stopwordsfilter(str_detect(token, "[[:letter:]]{2}")) %>%mutate(after_filter4 =n_distinct(thesis_id)) %>% { message("Doc removed after filter: ", unique(.$after_filter3) -unique(.$after_filter4)); . } %>%# filter list of stop words filter(!token %in% stop_words) %>%mutate(after_filter5 =n_distinct(thesis_id)) %>% { message("Doc removed after filter: ", unique(.$after_filter5) -unique(.$after_filter4)); . } %>%# Create bigramsgroup_by(thesis_id, sentence_id) %>%mutate(bigram =ifelse(token_id <lead(token_id), str_c(token, lead(token), sep ="_"), NA)) %>%ungroup()# SELECT BIGRAMS bigrams <- parsed_filtered %>%select(thesis_id, sentence_id, token_id, bigram) %>%filter(!is.na(bigram)) %>%mutate(window_id =1:n()) %>%add_count(bigram) %>%filter(n >10) %>%separate(bigram, c("word_1", "word_2"), sep ="_") %>%filter(if_all(starts_with("word"), ~! . %in% stop_words))bigram_pmi_values <- bigrams %>%pivot_longer(cols =starts_with("word"), names_to ="rank", values_to ="word") %>%mutate(word =paste0(rank, "_", word)) %>%select(window_id, word, rank) %>% widyr::pairwise_pmi(word, window_id) %>%arrange(item1, pmi) %>%filter(str_detect(item1, "word_1")) %>%mutate(across(starts_with("item"), ~str_remove(., "word_(1|2)_"))) %>%rename(word_1 = item1,word_2 = item2,pmi_bigram = pmi) %>%group_by(word_1) %>%mutate(rank_pmi_bigram =1:n())bigrams_to_keep <- bigrams %>%left_join(bigram_pmi_values) %>%filter(pmi_bigram >3) %>%mutate(bigram =paste0(word_1, "_", word_2)) %>%distinct(bigram) %>%mutate(keep_bigram =TRUE)parsed_final <- parsed_filtered %>%left_join(bigrams_to_keep) %>%mutate(token =if_else(keep_bigram, bigram, token, missing = token),token =if_else(lag(keep_bigram), lag(bigram), token, missing = token),token_id =if_else(lag(keep_bigram), token_id -1, token_id, missing = token_id)) %>%distinct(thesis_id, sentence_id, token_id, token)term_list <- parsed_final %>%rename(term = token)saveRDS(term_list, here(website_data_path, "term_list.rds"))# PREPARE STM INPUT term_list <-readRDS(here(website_data_path, "term_list.rds"))# stm object with covariate metadata <- term_list %>%distinct(thesis_id) %>%left_join(data, by ="thesis_id") %>%mutate(year_defence =as.numeric(year_defence)) %>%distinct(thesis_id, abstract_fr, year_defence, gender_expanded) %>%# filter lines with na covariates filter(!is.na(abstract_fr),!is.na(gender_expanded), !is.na(year_defence))#transform list of terms into stm object corpus_in_dfm <- term_list %>%# remove observations deleted by the metadata filter filter(thesis_id %in% metadata$thesis_id) %>%add_count(term, thesis_id) %>%cast_dfm(thesis_id, term, n)corpus_in_stm <- quanteda::convert(corpus_in_dfm, to ="stm", docvars = metadata)saveRDS(corpus_in_stm, here(website_data_path, "corpus_in_stm.rds"))
We can then run the topic model using gender_expanded and year_defence as covariates of the prevalences \(\theta_{1:K}\). In simple terms, instead of assuming that all topics are equally distributed across documents, the inclusion of covariates ensures that topic prevalence is shaped by the structure of the data reflected in these covariates. The covariates constrain the range of possible topic prevalences (\(\theta_d\)) for each document, aligning them with patterns in the metadata.
For a corpus of 11,000 documents, starting with \(K=100\) (the number of topics) provides a reasonable basis for exploration, balancing granularity (enough topics to capture diversity) and interpretability (avoiding overly fine-grained or redundant topics).
Figure 2 presents the top 20 topics ranked by average prevalence. For each topic, the most probable words from the topic content (based on the \(\beta_{1:K}\) distributions) are assigned. Unsurprisingly, the top topics closely align with common subfields in economics, such as development economics, monetary economics, and the economics of innovation.
Show the code
stm <-readRDS(here(website_data_path, "stm.rds"))label_topic <-labelTopics(stm, n =7) top_terms_prob <- label_topic %>% .[[1]] %>%as_tibble() %>%reframe(topic_label_prob =pmap_chr(., ~paste(c(...), collapse =", "))) %>%mutate(topic =row_number()) # tidy call gamma the prevalence matrix, stm calls it theta theta <- broom::tidy(stm, matrix ="gamma") %>%# broom called stm theta matrix gamma left_join(top_terms_prob, by ="topic") #### plot summary of topics ####theta_mean <- theta %>%group_by(topic, topic_label_prob) %>%# broom called stm theta matrix gamma summarise(theta =mean(gamma)) %>% ungroup %>%mutate(topic =reorder(topic, theta)) %>%slice_max(theta, n =30)theta_mean %>%ggplot() +geom_segment(aes(x =0, xend = theta, y = topic, yend = topic ),color ="black",size =0.5) +geom_text(aes(x = theta, y = topic, label = topic_label_prob),hjust =-.01,size =3 ) +scale_x_continuous(expand =c(0, 0),limits =c(0, max(theta_mean$theta)*3.04), ) + ggthemes::theme_hc() +theme(plot.title =element_text(size =8)) +labs(x ="Average prevalence by document",y =NULL,caption ="Note: words are most probable words" )
Figure 2: Top 30 topics by prevalence
Effect of gender
Now that we have estimated our topics and ensured their interpretability, we may ask to what extent the author’s gender predicts the topic prevalence for a given document. Given the increasing number of female authors over time, it is more likely that recent topics will exhibit a higher proportion of female authorship. Therefore, we include the year of defense in the regression to control for this increasing proportion.